Assigning mutational signatures to individual samples and individual somatic mutations with SigProfilerAssignment

Abstract Motivation Analysis of mutational signatures is a powerful approach for understanding the mutagenic processes that have shaped the evolution of a cancer genome. To evaluate the mutational signatures operative in a cancer genome, one first needs to quantify their activities by estimating the number of mutations imprinted by each signature. Results Here we present SigProfilerAssignment, a desktop and an online computational framework for assigning all types of mutational signatures to individual samples. SigProfilerAssignment is the first tool that allows both analysis of copy-number signatures and probabilistic assignment of signatures to individual somatic mutations. As its computational engine, the tool uses a custom implementation of the forward stagewise algorithm for sparse regression and nonnegative least squares for numerical optimization. Analysis of 2700 synthetic cancer genomes with and without noise demonstrates that SigProfilerAssignment outperforms four commonly used approaches for assigning mutational signatures. Availability and implementation SigProfilerAssignment is available under the BSD 2-clause license at https://github.com/AlexandrovLab/SigProfilerAssignment with a web implementation at https://cancer.sanger.ac.uk/signatures/assignment/.


Description of SigProfilerAssignment's algorithm
Mathematically, a mutational schema can be represented as a finite alphabet Ξ of mutation types containing a total of ξ letters.Here, a mutational signature is defined as a probability mass function with domain the alphabet Ξ.In vector notations, a mutational signature can be denoted as  $⃗ = ( !,  " , … ,  # , $ , where  % , 1 ≤  ≤ , is the probability for the mutational signature,  $⃗, to cause mutations of type corresponding to the  &' letter of the alphabet Ξ.Since a mutational signature is a probability mass function, 0 ≤  % ≤ 1 and ∑  % ( %)! = 1.As such, a set of known  mutational signatures can be expressed as a signature matrix,  ∈ ℝ * ( × -, where  = [ $⃗  ,  $⃗  , … ,  $⃗  ].Further, a set of mutations in a cancer genome can be defined as : Ξ → ℕ * # .In vector notations, a set of mutations in a cancer genome  $ $⃗ = ( !,  " , … ,  # , $ , where  % , 1 ≤  ≤ , reflects the number of mutations in that cancer genome of the mutation type corresponding to the  &' letter of the alphabet Ξ. SigProfilerAssignment takes as an input a signature matrix, , and a set of mutations,  $ $⃗, to output a column vector of activities  $ $⃗ = [ !,  " , … ,  -] $ , where  & ∈ ℕ  -, 1 ≤  ≤ , corresponding to the number of somatic mutations attributed to the  &' mutational signature.The underlying assumption of assigning mutational signatures is that the mutations within a sample can be approximated as a superposition of known mutational signatures and their activities: Thus, subject to  $ $⃗ ≥ 0, one needs to derive the vector  $ $⃗ that best fits the provided input data.To solve this optimization problem, SigProfilerAssignment uses a custom implementation of the forward stagewise algorithm (Hastie, et al., 2009) and it applies nonnegative least squares (NNLS) (Lawson and Hanson, 1977), based on the Lawson-Hanson method (Lawson and Hanson, 1977): The algorithm starts by first computing a minimum relative error, , by deriving the optimal nonnegative vector  $ $⃗ for the complete set of all reference signatures, , using equation (2).This minimum error provides the best possible explanation of the data, but it also results in overfitting as all available signatures are utilized.Next, the tool uses steps for removing and adding signatures based on the backward and forward stepwise algorithms, respectively (Hastie, et al., 2009).First, signatures are removed by employing a backward stepwise algorithm (Hastie, et al decrease of the error rate is added back to the signature set, , provided that the increase is more than a specific threshold (default value of 0.05).After the final addition of the signature with most relative rate decrease, the minimum relative error, ϵ 678 , and the set of signatures, , are updated to reflect this addition.The addition step is repeated until all signatures satisfying the conditions are added back to .Lastly, the addition and removal steps are repeated until convergence, where no signature is added or removed from the list of signatures (Algorithm 1).
In addition to quantifying the activity of each mutational signature, SigProfilerAssignment also assigns known signatures to individual mutations (Fig. 1B) based on their specific mutational context:

Distribution and Usage
SigProfilerAssignment is distributed as a Python package and it is available under a permissive

Benchmarking of bioinformatics tools for refitting known mutational signatures
To evaluate the performance of tools for refitting known mutational signature, we used a standard set of evaluation metrics and compared SigProfilerAssignment with another four commonly used approaches: deconstructSigs (Rosenthal, et al., 2016), MutationalPatterns (Blokzijl, et al., 2018;Manders, et al., 2022), sigLASSO (Li, et al., 2020), and SignatureToolsLib (Degasperi, et al., 2020;Degasperi, et al., 2022).Specifically, each tool was applied to 2,700 previously simulated cancer genomes (Islam, et al., 2022), corresponding to 300 simulated tumors from nine different cancer types, including: bladder transitional cell carcinoma, esophageal adenocarcinoma, breast adenocarcinoma, lung squamous cell carcinoma, renal cell carcinoma, ovarian adenocarcinoma, osteosarcoma, cervical adenocarcinoma, and stomach adenocarcinoma.The cancer genomes of these samples were simulated using 21 different COSMIC SBS reference signatures.To emulate a typical refitting of mutational signatures, each tool was applied by utilizing the complete set of 79 COSMICv3.3SBS signatures.After assigning the signatures, the assignment of each signature to each sample was classified as either a true positive (TP), false positive (FP), or false negative (FN) result.A known signature was considered TP if at least one mutation was assigned to the signature by a particular tool and the ground truth activity of the signature was greater than zero.
In contrast, a signature was classified as FP when it was assigned by a tool, but the ground truth activity was zero.Lastly, FN results were signatures with ground truth activities above zero that were not assigned any somatic mutation.These standard metrics allowed calculating the precision, sensitivity, and F1 score of each tool per sample, defined as: These metrics were calculated for each synthetically generated sample and, subsequently, averaged to obtain a final accuracy value for each random noise level (0%, 5%, and 10%).
For the ID and DBS benchmarking, synthetic mutational profiles were generated following the same methodology used for constructing the previously published SBS dataset (Islam, et al., 2022), using the GenerateSyntheticTumors function of the SynSigGen R package (https://github.com/steverozen/SynSigGen).This package uses the original activities from the PCAWG analysis of mutational signatures (Alexandrov, et al., 2020) to derive synthetic mutational profiles per cancer type.This simulation process requires that at least two different signatures are assigned to each sample from every specific cancer type.Considering this, we generated synthetic datasets for DBS and ID variant classes using the same nine cancer types previously used in the SBS benchmarking (300 simulated samples from each cancer type), including bladder transitional cell carcinoma, esophageal adenocarcinoma, breast adenocarcinoma, lung squamous cell carcinoma, renal cell carcinoma, ovarian adenocarcinoma, osteosarcoma, cervical adenocarcinoma, and stomach adenocarcinoma.However, due to the limitation mentioned above, cervical adenocarcinoma was removed for the synthetic ID profile generation since only ID1 was present in the original activities of the PCAWG samples.For the generation of the synthetic DBS dataset, cervical adenocarcinoma was also removed (only DBS4 was assigned to one of the PCAWG samples), along with lung squamous cell carcinoma, as only the tobacco-associated DBS2 signature was assigned to several of the PCAWG cases.In summary, 2,100 synthetic DBS samples and 2,400 synthetic ID samples were generated (300 for each of the seven and eight cancer types, respectively).In the case of copy number alterations, since this mutation type was not supported by SynSigGen, the pan-cancer activities from the original publication describing the COSMICv3.3CN signatures (Steele, et al., 2022) were used and multiplied by the reference signatures to get a synthetic dataset encompassing 9,699 synthetic samples from 33 different cancer types.Regarding the input set of known mutational signatures, in all three cases the most recent COSMICv3.3version of the reference signatures was used, including 18 ID, 11 DBS, and 24 CN signatures.
To benchmark the computational performance of the different bioinformatics tools, their CPU elapsed time and peak memory usage were monitored and averaged for the three noise levels.
SigProfilerAssignment v0.0.28 was run using default parameters.deconstructSigs (Rosenthal, et al., 2016)   for assigning mutational signatures.The F1 scores (harmonic mean of precision and sensitivity) 14 for the nine cancer types included in the synthetic dataset (300 simulated genomes for each cancer 15 type) were used to evaluate the accuracy of the signature assignment across the simulations with 16 non-systematic random noise.17 Synthetic DBS, ID, and CN samples were used to test the accuracy of SigProfilerAssignment 27 signature assignment using COSMICv3.3signatures as input known mutational signatures.Three 28 different levels of non-systematic random noise (0%, 5%, and 10%) were used to evaluate the 29 precision (x-axes), sensitivity (y-axes), and F1 scores (harmonic mean of precision and sensitivity; 30 red dotted lines) of each tool.31 32 where,  % & represents the probability of a mutation corresponding to the  &' letter of the alphabet Ξ being caused by the  &' signature in the sample;  % & is the probability of the  &' signature to cause mutation corresponding to the  &' letter of the alphabet Ξ;  & is the number of mutations attributed to the  &' mutational signature; and [ $ $⃗] % is the value of the  &' element of the vector obtained by the matrix multiplication of the signature matrix, , and the derived signature activities,  $ $⃗.
The main output of SigProfilerAssignment includes the activity of each known mutational signature for each of the supplied samples, the reconstruction of the original dataset, and the probability of each individual mutation being caused by a specific signature.The latter is not provided when the input file is a mutational vector or mutational matrix as this input format lacks information about individual somatic mutations.Signature activities correspond to the specific numbers of mutations from the original catalog caused by a particular mutational process.Considering these activities, as well as the provided set of known mutational signatures, a reconstruction of the original mutational catalog for each sample is derived.Different accuracy metrics for this reconstruction are outputted by SigProfilerAssignment, including cosine similarity, Kullback-Leibler divergence, Pearson correlation, L1 relative error, and L2 relative error.The signature assignment results are summarized using three independent visualizations: (i) a bar plot depicting the activities of all mutational signatures within a sample; (ii) a tumor mutational burden (TMB) signature plot showing the activities per mutational signature; and (iii) an individual reconstruction plot per sample, which includes the mutational profiles for both the original and the reconstructed input sample, different accuracy metrics, and the mutational profiles for each of the known mutational signatures assigned to that sample.For the online version of the tool, an interactive heatmap plot, including the signatures' activities and the samples' reconstruction accuracies is also provided.Raw data files containing activities, reconstruction metrics, and signature probabilities for individual mutations are generated by the desktop tool and can be downloaded from the online version.

Figure S1 .
Figure S1.Tissue type-specific benchmarking of SigProfilerAssignment and four other tools 13